{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n\n# Torch Export with Cudagraphs\n\nCUDA Graphs allow multiple GPU operations to be launched through a single CPU operation, reducing launch overheads and improving GPU utilization. Torch-TensorRT provides a simple interface to enable CUDA graphs. This feature allows users to easily leverage the performance benefits of CUDA graphs without managing the complexities of capture and replay manually.\n\n<img src=\"file://tutorials/images/cuda_graphs.png\">\n\nThis interactive script is intended as an overview of the process by which the Torch-TensorRT Cudagraphs integration can be used in the `ir=\"dynamo\"` path. The functionality works similarly in the\n`torch.compile` path as well.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Imports and Model Definition\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\nimport torch_tensorrt\nimport torchvision.models as models"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Compilation with `torch_tensorrt.compile` Using Default Settings\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# We begin by defining and initializing a model\nmodel = models.resnet18(pretrained=True).cuda().eval()\n\n# Define sample inputs\ninputs = torch.randn((16, 3, 224, 224)).cuda()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Next, we compile the model using torch_tensorrt.compile\n# We use the `ir=\"dynamo\"` flag here, and `ir=\"torch_compile\"` should\n# work with cudagraphs as well.\nopt = torch_tensorrt.compile(\n    model,\n    ir=\"dynamo\",\n    inputs=torch_tensorrt.Input(\n        min_shape=(1, 3, 224, 224),\n        opt_shape=(8, 3, 224, 224),\n        max_shape=(16, 3, 224, 224),\n        dtype=torch.float,\n        name=\"x\",\n    ),\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Inference using the Cudagraphs Integration\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# We can enable the cudagraphs API with a context manager\nwith torch.no_grad():\n    with torch_tensorrt.runtime.enable_cudagraphs(opt) as cudagraphs_module:\n        out_trt = cudagraphs_module(inputs)\n\n    # Alternatively, we can set the cudagraphs mode for the session\n    torch_tensorrt.runtime.set_cudagraphs_mode(True)\n    out_trt = opt(inputs)\n\n    # We can also turn off cudagraphs mode and perform inference as normal\n    torch_tensorrt.runtime.set_cudagraphs_mode(False)\n    out_trt = opt(inputs)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# If we provide new input shapes, cudagraphs will re-record the graph\ninputs_2 = torch.randn((8, 3, 224, 224)).cuda()\ninputs_3 = torch.randn((4, 3, 224, 224)).cuda()\n\nwith torch.no_grad():\n    with torch_tensorrt.runtime.enable_cudagraphs(opt) as cudagraphs_module:\n        out_trt_2 = cudagraphs_module(inputs_2)\n        out_trt_3 = cudagraphs_module(inputs_3)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Cuda graphs with module that contains graph breaks\n\nWhen CUDA Graphs are applied to a TensorRT model that contains graph breaks, each break introduces additional\noverhead. This occurs because graph breaks prevent the entire model from being executed as a single, continuous\noptimized unit. As a result, some of the performance benefits typically provided by CUDA Graphs, such as reduced\nkernel launch overhead and improved execution efficiency, may be diminished.\n\nUsing a wrapped runtime module with CUDA Graphs allows you to encapsulate sequences of operations into graphs\nthat can be executed efficiently, even in the presence of graph breaks. If TensorRT module has graph breaks, CUDA\nGraph context manager returns a wrapped_module. And this module captures entire execution graph, enabling efficient\nreplay during subsequent inferences by reducing kernel launch overheads and improving performance.\n\nNote that initializing with the wrapper module involves a warm-up phase where the\nmodule is executed several times. This warm-up ensures that memory allocations and initializations are not\nrecorded in CUDA Graphs, which helps maintain consistent execution paths and optimize performance.\n\n<img src=\"file://tutorials/images/cuda_graphs_breaks.png\" scale=\"60 %\" align=\"left\">\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class SampleModel(torch.nn.Module):\n    def forward(self, x):\n        return torch.relu((x + 2) * 0.5)\n\n\nmodel = SampleModel().cuda().eval()\ninput = torch.randn((1, 3, 224, 224)).cuda()\n\n# The 'torch_executed_ops' compiler option is used in this example to intentionally introduce graph breaks within the module.\n# Note: The Dynamo backend is required for the CUDA Graph context manager to handle modules in an Ahead-Of-Time (AOT) manner.\nopt_with_graph_break = torch_tensorrt.compile(\n    model,\n    ir=\"dynamo\",\n    inputs=[input],\n    min_block_size=1,\n    pass_through_build_failures=True,\n    torch_executed_ops={\"torch.ops.aten.mul.Tensor\"},\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If module has graph breaks, whole submodules are recorded and replayed by cuda graphs\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "with torch.no_grad():\n    with torch_tensorrt.runtime.enable_cudagraphs(\n        opt_with_graph_break\n    ) as cudagraphs_module:\n        cudagraphs_module(input)"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.14"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}